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Abstract 

We study the problem of estimating a temporally varying coefficient 
and varying structure (VCVS) graphical model underlying nonstationary 
time series data, such as social states of interacting individuals or microar- 
ray expression profiles of gene networks, as opposed to i.i.d. data from an 
invariant model widely considered in current literature of structural esti- 
mation. In particular, we consider the scenario in which the model evolves 
in a piece-wise constant fashion. We propose a procedure that minimizes 
the so-called TESLA loss (i.e., temporally smoothed LI regularized re- 
gression), which allows jointly estimating the partition boundaries of the 
VCVS model and the coefficient of the sparse precision matrix on each 
block of the partition. A highly scalable proximal gradient method is 
proposed to solve the resultant convex optimization problem; and the 
conditions for sparsistent estimation and the convergence rate of both the 
partition boundaries and the network structure are established for the 
first time for such estimators. 



1 Introduction 

Networks are a fundamental form of representation of relational information 
underlying large, noisy data from various domains. For example, in a biolog- 
ical study, nodes of a network can represent genes in one organism and edges 
can represent associations or regulatory dependencies among genes. In a so- 
cial analysis, nodes of a network can represent actors and edges can represent 
interactions or friendships between actors. Exploring the statistical properties 
and hidden characteristics of network entities, and the stochastic processes be- 
hind temporal evolution of network topologies is essential for computational 
knowledge discovery and prediction based on network data. 

In many dynamical environments, such as a developing biological system, it is 
often technically impossible to experimentally determine the network topologies 
specific to every time point in a discrete time series. Resorting to computational 
inference methods, such as extant structural learning algorithms, is also difficult 
because for every model unique to a single time point, there exist as few as only 
a single snapshot of the nodal states distributed accordingly to the model in 
question. In this paper, we consider an estimation problem under a particular 
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dynamic context, where the model evolves piecewise constantly, i.e., staying 
structurally invariant during unknown segments of time, and then jump to a 
different structure. 

A popular technique for deriving the network structure from iid sample is 
to estimate a sparse precision matrix. The importance o f estimating precision 
matrices with zeros was recognized by (jDempsterl . 1 19T2T) who coined the term 
covariance selection. The elements of the precision matrix represent the as- 
sociations or conditional covariances between corresponding variables. Once a 
sparse precision matrix is estimated, a network can be drawn by connecting 
variables whose corresponding elements of the precision matrix are non-zero. 
Recent studies have shown that covariance selection methods based on the pe- 
nalized likelihood maximization can lead to a consistent esti mate of the net- 
work structure underlyi ng a Gaussian Markov Random Fields (IFan et al. . 2009; 



Ravikumar et al. , 2008|) . Moreover, a particular procedure for covariance selec- 



tion known as neighborhood selection, which is built on l\ norm regularized 
regression, can produce a consistent estimate of the network structure when 
the sample is assumed to follow a general Markov Ra ndom Field distribution 
whose structure corresponds t o the network in q u estion (jRavikumar et al. ■ l2009t 



Meinshausen and Buhlmann , l2006t iPeng et all l2009h . Specifically, a Markov 



Random Field (MRF) is a probabilistic graphical model defined on a graph 
G = (V,E), where V — {1, . . . ,p} is a vertex set corresponding to the set of 
random variables to be modeled (in this paper we call them nodes and variables 
interchangeably) , and E C V x V is the edge set capturing conditional indecen- 
cies among these nodes. Let X = (Xi, . . . , X p )' denote a p-dimensional random 
vector, whose elements are indexed by the nodes of the graph G. Under the 
MRF, a pair (a, b) is not an element of the edge set E if and only if the variable 
X a is conditionally independent of Xb given all the rest of variables Xy\s aib \, 
X a _L Xb\Xy\s a b\- A distribution over X can be defined by taking the fol- 
lowing log linear form that makes explicit use of the (presence and absence of 
edges in the) edge set: p(X) oc exp{J2^ a b ^ eV 9 ab X a X b }. When the elements 
of the random vector X are discrete, e.g., X e {0, 1} P , the model is referred 
to as a discrete MRF, sometimes known as an Ising model in statistics physics 
community; whereas when X is a continuous vector, the model is referred to 
as a Gaussian graphical model (GGM) because one can easily show that the 
p(X) above is actually a multivariate Gaussian. The MRF have been widely 
used for modeling data with graphical relational structures over a fixed set of 
entities ( Wainwright and JordanL 20081 iGetoor and Taskar . 2007 ). The vertices 
can describe entities such as genes in a biological regulatory network, stocks 
in the market, or people in society; while the edges can describe relationships 
between vertices, for example, interaction, correlation or influence. 

The statistical problem we concern in this paper is to estimate the struc- 
ture of the Gaussian graphical model from observed samples of nodal states in 
a dynamic world. Traditional methods handle this problem with the assump- 
tion that the samples are iid. Let T> = {xi,...,x„} be an independent and 
identically distributed sample according to a p-dimensional multivariate normal 
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distribution Af p (0, S), where X is the covariance matrix. Let O := denote 
the precision matrix, with elements (w a f>), 1 < a, ^ < P- Then one can obtain an 
estimator of the ft from V via optimizing a proper statistical loss function, such 
as likelihood or penalized likelihood. As mentioned earlier, the precision matrix 
fi encodes the conditional independence structure of the distribution and the 
pattern of the zero elements in the precision matrix define the structure of the 
associated graph G. There has been a dramatic growth of interest in recent 
literature in the problem of covariance selection, which deals with the graph 
estimation problem above. Existing works range from algorithmic development 
focusing on efficient estimation procedures, to theoretical analysis focusing on 
statistical guarantees of different estimators. We do not intend to give an ex- 
tensive overview of the literature here, but inte rested readers ca n follow the 
pointers bellow. In the classical literature (e.g. Lauritzenl Il996 ). procedures 



are developed for small dimensional graphs and commonly involve hypothe- 
sis testing with greedy selection of edges. More recent lit erature estimates the 
sparse pre c ision matrix by optim i zing penalized likelihoo d ( Yuan and Linl. 2007 ; 



Fan et all 120091; iBaneriee et all 120081 iRothman et all. l2008t iFriedman et al 



20081 iRavikumar et all l2008t iGuo et all l2010bl IZhou et al l 120081) or through 



neighborhood selection ( Meinshausen and Buhlmann , 2006 ; Peng et al. . 20091 : 



Guo et al. , 2010a; IWang et al. , 2009). where the structure of the graph is esti- 



mated by estimating the neighborhood of each node. Both of these approaches 
are suitable for high-dimensional problems, even when p»n, and can be effi- 
ciently implemented using scalable convex program solvers. 

Most of the above mentioned work assumes that a single invariant network 
model is sufficient to describe the dependencies in the observed data. However, 
when the observed data are not iid, such an assumption is not justifiable. For 
example, when data consist of microarray measurements of the gene expression 
levels collected throughout the cell cycle or development of an organism, dif- 
ferent genes can be active during different stages. This suggests that different 
distributions and hence different networks should be used to describe the depen- 
dencies between measured variables at different time intervals. In this paper, 
we are going to tackle the problem of estimating the structure of the GGM 
when the structure is allowed to change over time. By assuming that the pa- 
rameters of the precision matrix change with time, we obtain extra flexibility to 
model a larger class of distributions while still retaining the interpretability of 
the static GGM. In particular, as the coefficients of the precision matrix change 
over time, we also allow the structure of the underlying graph to change as well. 
This semi-parametric generalization of the parametric model is referred to as a 
varying coefficient varying structure (VCVS) model. 

Now, let {Xj}j e r n i £ MP be a sequence of n independent observations (we use 
[n] to denote the set {1, . . . , n}) from some p-dimensional multivariate normal 
distributions, not necessarily the same for every observation. Let be 
a disjoint partitioning of the set [n] where each block of the partition consists 
of consecutive elements, that is, S J ' fl = for j ^ f and S J = [n] and 
&> = [T,-_i : Tj] := {T^ U T^ X + 1, . . . , T, } - 1}. Let T := {T = 1< T x < . . . < 
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Tb = n + 1} denote the set of partition boundaries. We consider the following 
model 

Xi~tf p {0,IP), ieB j , (1) 

so that observations indexed by elements in B J are p-dimensional realizations 
of a multivariate normal distribution with zero mean and the covariance matrix 
S J = (cr^ b )a,6e[p]- Let O- 7 := (S 3 ) -1 denote the precision matrix with elements 
(io 3 ab )a,b£\p\- With the number of partitions, B, and the boundaries of partitions, 
T, unknown, we study the problem of estimating both the partition set {B J } 
and the non-zero elements of the precision matrices {fl?}^^] from the sam- 
ple {xi} ie [„j. Note that in this work we study a particular case of the VCVS 
model, where the coefficients are piece-wise constant functions of time. A sce- 
nario where t he coefficients are smoothly varying fu nctions of time has been 
considered in IZhou et al.l ([20081 ) for the GGM and in iKolar et all (|2010bl ) and 
Kolar and Xing ( 2009h for an Ising model. 



If the partitions {B^}j were known, the problem would be trivially reduced to 
the setting analyzed in the previous work. Dealing with the unknown partitions, 
together with the structure estimation of the model, calls for new methods. We 
propose and analyze a method based on time- coupled neighborhood selection, 
where the model estimates are forced to stay similar across time using a fusion- 
type total variation penalty and the sparsity of each neighborhood is obtained 
through the l\ penalty. Details of the approach are given in §2. 

The model in Eq. flU is related to the varying-coefficient models (e.g. 
lHastie and Tibshirantll993h with the coefficients being piece- wise constant func- 
tions. Varying coefficient regression models with piece-wise c onstant coefficie nts 
are also known as segmented multivaria te regression models ([Liu et al l ll997h or 
linear models with structural changes ( Bai and Perronl 1998| ). The structural 
changes are commonly determined through hypothesis testing and a separate 
linear model is fit to each of the estimated segments. In our work, we use the 
penalized model selection approach to jointly estimate the partition boundaries 
and the model parameters. 

Little work has been done so far to wards mode l ing dy namic networks and 
estimating changing precision matrices. IZhou et al.l (|2008l ) develops a nonpara- 
metric method for estimation of time- varying GGM, where x* ~ Af p (0, E(i)) 
and E(i) is smoothly chan ging over time. The p rocedure is based on the penal- 
ized likelihood approach of I Yuan and Linl (|2007l ) with the empirical covariance 
matrix obtained us i ng a kernel smoother. Our work is very different from the 
one of Zhou et al. ( 20081 ). since under our assumptions the network changes 
abruptly rather than smoothly. Furthermore, as we outline in $21 our estima- 
tion procedure is not based on the penalized like lihood approach. Estim ation 
of time- vary i ng Isin g models has been d iscuss ed in I Ahmed and Xing (2009) and 



Kolar et al.l (|2010bT ). lYin et al.l (|2008j ) and IKolar et al.l (|2010al ) studied non- 



parametric ways t o esti mate the conditional covariance matrix. The work of 
Ahmed and Xin d (2009; is most similar to our setting, where they also use a 
fused-type penalty combined with an l\ penalty to estimate the structure of 
the verying Ising model. Here, in addition to focusing on GGMs, there is an 
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additional subtle, but important, difference to lAhmed and Xind (l2009h . In this 
work, we use a modification of the fusion penalty (formally described in fj2j 
which allows us to characterize the model selection consistency of our estimates 
and the convergence properties of the estimated partition boundaries, which is 
not available in the earlier work. 

The remaining of the paper is organized as follows. In <J21 we describe our 
estimation procedure and provide an efficient first-order optimization procedure 
capable of estimating larg e graphs. The o ptimization procedure is based on 
the smoothing procedure of lNesterov (2005) and converges in 0(l/e) iterations, 
where e is the desired accuracy. Our main theoretical results are presented in JJJ 
In particular, we show that the partition boundaries are estimated consistently. 
Furthermore, the graph structure is consistently estimated on every block of the 
partition that contains enough samples. Numerical studies showing the finite 
sample performance of our procedure are given in 2) The proofs of the main 
results are relegated to wfth some technical details presented in Appendix. 



Notation Schemes: 

For clarity, we end the introduction with a summary of the notations used in 
the paper. We use [n] to denote the set {l,...,n} and [I : r] to denote the 
set {I, I + 1, . . . , r — 1}. We use B^ to denote j-th block of the partition T. 
With some abuse of notation, we also use B^ to denote the set p}-i : Tj]. 
The number of samples in the block B^ is denoted as \B^\. For a set S C V, 
we use the notation Xg to denote the set {X a : a £ S} of random variables. 
We use X to denote the n x p matrix whose rows consist of observations. The 
vector X a = (xi i0 , . . . , x nia )' denotes a column of matrix X and, similarly, 
Xs = (Xfe : b £ S) denotes the n x \S\ sub- matrix of X whose columns are 
indexed by the set S and X 0J denotes the sub- matrix x p whose rows are 
indexed by the set BK For simplicity of notation, we will use \a to denote the 
index set [p]\{a}, X\ a = (X& : b £ [p]\{a}). For a vector a £ MP, we let 
5(a) denote the set of non-zero components of a. Throughout the paper, we 
use ci , C2 , . . . to denote positive constants whose value may change from line to 

line. For a vector a £ M™, define ||a||i = J2ie[ n ] l a «l' H a ll2 = \Jj2ie[ n ] a i an d 
Halloo = maxi |aj|. For a symmetric matrix A, A m i n (A) denotes the smallest and 
A max (A) the largest eigenvalue. For a matrix A (not necessarily symmetric), 
we use II A Hoc = maxjj^. |Ay|. For two vectors a, b £ M™, the dot product is 
denoted (a, b) = ^ e r n i aibi. For two matrices A, B £ R" xm , the dot product 
is denoted as ((A, B)) = tr(A'B). Given two sequences {a n } and {&„}, the 
notation a n = 0(b n ) means that there exists a constant c\ such that a n < 
C\b n ; the notation a n — £l(b n ) means that there exists a constant C2 such that 
a n > c 2^n and the notation a n x b n means that a n — 0(b n ) and b n = 0(a n ). 
Similarly, we will use the notation a n = o p (b n ) to denote that 6~ 1 a„ converges 
to in probability. 
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Table 1: Summary of symbols used throughout the paper 



Symbol 


Meaning 


Example 


\n\ 
I'M 


neon to nonnto tho Qpt 41 nY 

LI O V-L L VJ V_l V_ 1 1 \_f U\j ulllj Q V | 1^ • a • - f 




[h ■■ h] 


used to denote the set {ti, t\ + 1, . . . , t<i — 1} 




i 


used for indexing related to samples 


or B a - 




iicci/H i cw innovinrf ro 1 ci i~f*r\ 11 c 1/" 
LlOUU. 1UI lllLil^JVlllJi lLlcLLH_l LU U1UL.JS. 


a 


n h 


1 1 gpH rAr inn pvi n cr tiaHaq i ti fi ov 3 n 

USl_.*_i 1U1 IHULAIH^ IIULILB 111 CI gl dJJll 




G 


tViP PT?)r>ri fOTmistinP" of vortiros and oHp'os 

Ui-lt- til OiL/ll VjUllOliJ UlllSL, VJL \VjL UIIjVjiJ CI 11 LI ^ <L1£L L^iJ 


G — (V E) 


y 


tnp opt p,t" nnrlps in a orfiTtn 
uiic- oc-ij kj± iiu \_n_<o 111 a clicxl/ii 


v — y 




the apf nf pHq-pg q+ time 7 




X 


tno cnTnTinnoTit nf 'A vAYiclcwn "5C inrlpYPrl l~iv trip votIpy n 




l J -,i 


trip vpptov at ypO'T'pQ'^in'n pnpfnpip'ntc. tc\t Q^TiTnlp 1 

L11C VLljlUl \JL ILtlVjODlUll LUclllLlC 1 1 U ij Ivll OG1111L/H3 




ga,j 


the vector of repression coefficients for block" i 




T 


the set of partition boundaries 






the set of boundary fractions 






an index set for the samples in the partition j 


B j c [n] 


B 


denotes the number of partitions 




s i 


the set of neighbors of node a in block j 




s{d a ^) 


the set of non-zero elements of aj 




s j 


the closure of S 3 a 


Si = SlU {a} 




nodes not in the neighborhood of the node a in block j 


Nl = \p]\Si 


\a 
H 


the set of all vertices excluding the vertex a 
cardinality of a set or absolute value 


\a=\p}\{a} 




the covariance matrix 






an element of the covariance matrix 




n 


the precision matrix 






an element of the precision matrix 




(;■) 


the dot product 


(a, b) = a'b 


({;■)) 


the dot product between matrices 


«A,B» =tr(A'B) 
\\0 a 'i - 9^-% > Uin 


£min 


the minimum change between regression coefficient 


^min 


the minimum size of a coefficient 


\0b'' J 1 — ^min 


^min 


the minimum size of a block 


\&\ > A min 



G 



2 Graph estimation via Temp oral- Difference Lasso 



In this section, we introduce our time-varying covariance selection procedure, 
which is based on the time-coupled neighborhood selection using the fused-type 
penalty. We call the proposed procedure Temporal-Difference Lasso ( TD-Lasso). 
We start by reviewing the basic neighborhood selection pr ocedure, wh i ch ha s 
previously been used to estimate g raphs in, for example, Peng et all d2009l), 



Meinshausen and Buhlmann ( 2006 ) , Ravikumar et al. ( 20091) and Guo et alT i 2010a ) 



We start by relating the elements of the precision matrix ft to a regression 
problem. Let the set S a to denote the neighborhood of the node a. Denote S a 
the closure of S a , S a := S a U{a}, and N a the set of nodes not in the neighborhood 
of the node a, N a = [p]\S a . It holds that X a _L X^ a \X$ a . The neighborhood of 
the node a can be easily seen from the non-zero pattern o f the elem e nts in the 
precision matrix $7, S a = {b G [p]\{a} : uj a b ^ 0}. See Lauritzen (|l996l) for 



more details. It is a well known result for Gaussian graphical models that the 
elements of 

6 a = argmin E(X a - V X b 9 b ) 2 



are given by 9% = —uj a b/u aa . Therefore, the neighborhood of a node a, S a , is 
equal to the set of non-zero coefficients of 6 a . Using the expression for 9 a , we 
can write X a = X)&es ^b0 b + e, where e is independent of X\ a . 

The neighborhood selection procedure was motivated by the above relation- 
ship between the regression coefficie nts and the elements of the precision matrix. 
iMeinshausen and Buhlmannl ([2006) proposed to solve the following optimization 
procedure 

a = argmin -||X Q - X\ a 0\\ 2 2 + X\\9\\i (2) 

and proved that for iid sample the non-zero coefficients of a consistently esti- 
mate the neighborhood of the node a, under a suitably chosen penalty parameter 
A. 

In this paper, we build on the neighbourhood selection procedure to esti- 
mate the changing graph structure in model ([1]). We use S° a to denote the 
neighborhood of the node a on the block S J and 2Vjj to denote nodes not in 
the neighborhood of the node a on the j-th block, = V\S[. Consider the 
following estimation procedure 

$ a = argmin C(/3) + pen Al , Aj (/3) (3) 
where the loss is defined for (3 = {j3b^)b^\p-i} ,ie[n] as 

C(0) := ( x ^ - E ■'■'■>■ : '-) ( 4 ) 

i£[n] ^ b£\a ' 
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and the penalty is defined as 



Pen Al , A2 (/3) := 2A X £ ||/3., ( - /3.,*_i|| 2 + 2A 2 £ £ |/3 M |. (5) 

i=2 i=l 6e\a 

The penalty term is constructed from two terms. The first term ensures that 
the solution is going to be piecewise constant for some partition of [n] (possibly 
a trivial one). The first term can be seen as a sparsity inducing term in the 
temporal domain, since it penalizes the difference between the coefficients (3.^ 
and /3..i+i at successive time-points. The second term results in estimates that 
have many zero coefficients within each block of the partition. The estimated 
set of partition boundaries 

t = {f Q = 1} U {fj G [2 : n] : /§° fj ^ $ a f , _J U {f & = n + 1} 

contains indices of points at which a change is estimated, with B being an 
estimate of the number of blocks B. The estimated number of the block B is 
controlled through the user defined penalty parameter Ai, while the sparsity of 
the neighborhood is controlled through the penalty parameter A 2 . 

Based on the estimated set of partition boundaries T, we can define the 
neighborhood estimate of the node a for each estimated block. Let aj = /3 a i; 
Vi G [Tj-i : Tj] be the estimated coefficient vector for the block & = \Tj-i ■ %\- 
Using the estimated vector aj , we define the neighborhood estimate of the node 
a for the block BP as 

Si:=S(0 a >i):={be\a : ^> 0}. 

Solving ([3]) for each node a G V gives us a neighborhood estimate for each node. 
Combining the neighborhood estimates we can obtain an estimate of the graph 
structure for each point i G [n]. 

The choice of the penalty term i s motivated by the work on pen alization 
using total variation (Rinaldo, 2009; iMammen and van de Geerl . Il997l) . which 



results in a piece-wise constant approximation of an unknown regression func- 
tion. The fusio n-penalty has also been applied in the context of multivariate 
linear regression lTibshirani et aL ( 2005 ). where the coefficients that are spatially 



close, are also biased to have similar values. As a result, nearby coefficients are 
fused to the same estimated value. Instead of penalizing the t\ norm on the 
difference between coefficients, we use the £2 norm in order to enforce that all 
the changes occur at the same point. 

The objective estimates the neighborhood of one node in a graph for 
all time-points. After solving the objective ([3]) for all nodes a € V, we need to 
combine them to obtain the graph structure. We will use the following procedure 
to combine {/3 a } aS y, 

% = {(a, b) : max(|/3 fc a J, |/3^J) > 0}, i G [n]. 

That is, an edge between nodes a and b is included in the graph if at least 
one of the nodes a or b is included in the neighborhood of the other node. We 
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use the max operator to combine different neighborhoods as we believe that 
for the purpose of network exploration it is more important to occasionally 
include spurious edges than to omit relevant ones. For further discussion on the 
differences between the min and the max combination, we refer an interested 



reader to IBaneriee et al.l (|200 



2.1 Numerical procedure 

Finding a minimizer (3 a of can be a computationally challenging task for an 
off-the-shelf convex optimization procedure. We propose too u se an accelerated 
gradient method with a smoothing technique (lNesterovLl2005h . which converges 



in C(l/e) iterations where e is the desired accuracy. 

We start by defining a smooth approximation of the fused penalty term. Let 
H £ E nxn_1 be a matrix with elements 

- 1 if i = j 
H l0 = { 1 if i = j + 1 
otherwise. 



With the matrix H we can rewrite the fused penalty term as 2A i V!"-/ IK/3H). 



and u sing the fact that the £2 norm is self dual (for example, see lBovd and Vandenberghe 



i\\2 



20041) we have the following representation 



2A X ^ - j8.,<-i||2 = max ((U, 2Ai/3H» (6) 

i=2 e 

where Q := {U e jjp-ixn-i . ||u.^|j 2 < 1, Vi G [n — 1]}. The following 
function is defined as a smooth approximation to the fused penalty, 

* P 09) := max «U,2Ai/JH)) - /i||U||| (7) 
where /i > is the smoothness parameter. It is easy to see that 

< *o(/3) < + - !)• 

Setting the smoothness parameter to /1 = 2 (n~i) > ^ ne correct rate of convergence 
is ensured. Let U M (/3) be the optimal solution of the maximization problem in 
0, which can be obtained analytically as 

u,(/3) = n Q (^T) (8 ) 

where IIo(-) is the projection operator onto the set Q. From Theorem 1 in 



NesterovP 2005 ) . we have that ^ M (/3) is continuously differentiable and convex, 
with the gradient 

Vtf M (/3) = 2A 1 U (U (/3)H' (9) 
that is Lipschitz continuous. 
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With the above defined smooth approximation, we focus on minimizing the 
following objective 

min F(f3):= min C{0) + *„(/3) + 2A 2 ||^|| 1 . 



Following iBeck and Teboulld (|2009h (see also iNesterovl (|2007h ). we define the 



following quadratic approximation of F(f3) at a point Po 

Ql(P,Po) := C(p ) + * M (/3o) + ({/3 - A), V£(/9 ) + W(/3 )» 

L (10) 
+ gIliS- A)lH- + 2A 2 ||i9||i 

where L > is the parameter chosen as an upper bounds for the Lipschitz 
constant of V£ + V^. Let Pl(Po) be a minimizer of Ql(P,Pq). Ignoring 
constant terms, pl(Po) can be obtained as 

2 



Pl(Po) = argmin \ 



F L 



2X2 "PWr 



It is clear that Pl{Po) is the unique minimizer, which can be obtained in a closed 
form, as a result of the soft-thresholding, 

Pl (P ) = t(po-^(V£ + V*)(Po)^) (11) 

where T(x, A) = sign(x) max(0, \x\ — A) is the soft-thresholding operator that is 
applied element- wise. 

In practice, an upper bound on the Lipschitz constant of V£ + can be 
expensive to compute, so the parameter L is going to be determined iteratively. 
Combining all of the above, we have the following algorithm. In the algo- 
rithm, 7 is a constant used to increase the estimate of the Lipschitz constant 
L. Compared to the gradient descent method (which can be obtain by iterating 
Pk+i = PL{Pk)), the accelerated gradient method updates two sequences {Pk} 
and {zfe} recursively. Instead of performing the gradient step from the latest 
approximate solution Pf., the gradient step is performed from the search point 
Zfc that is obtained as a linear combination of the last to approximate solutions 
Pk-i and p k . Since the condition F(p L (z k )) < QL{PL( z k), z k) is satisfied in 
every iteration, we have t he algorithm converges in 0(1/ e) iterations following 
Beck and Teboulld (|2009h . As the convergence criterion, we stop iterating once 



the relative change in the objective value is below some threshold value. 
2.2 Tuning parameter selection 

The penalty parameters Ai and A2 control the complexity of the estimated 
model. In this work, we propose to use the BIC score to select the tuning 
parameters. Define the BIC score for each node a £ V as 

BIC a (A 1 ,A 2 ):=log^ + ^ ]T (12) 

U MB] 
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Algorithm: Accelerated Gradient Method for Equation ([3]) 

Input: X e K" xp , f3 G W~ lxn , 7 > 1, L > 0, ft = 
Output: a 



Initialize k := 1, := 1, z*. := /3 
repeat 

while F(p L (z k )) > Q L (p L (z k ). z k ) do 
I L:=jL 

(3 k :=p L {z k ) (using Eq. (TTT])) 




z fe+1 :=/3 fe + ^1(^-/3^) 




where £(•) is denned in (0} and a = a (Xi,X 2 ) is a solution of The penalty 
parameters can now be chosen as 



We will use the above formula to select the tuning parameters in our simulations, 
where we are going to search for the best choice of parameters over a grid. 

3 Theoretical results 

This section is going to address the statistical properties of the estimation pro- 
cedure presented in Section [2j The properties are addressed in an asymptotic 
framework by letting the sample size n grow, while keeping the other parameters 
fixed. For the asymptotic framework to make sense, we assume that there exists 
a fixed unknown sequence of numbers {r, } that defines the partition boundaries 
as Tj = [riTj\ , where [a\ denotes the largest integer smaller that a. This assures 
that as the number of samples grow, the same fraction of samples falls into every 
partition. We call {tj} the boundary fractions. 

We give sufficient conditions under which the sequence {tj} is consistently 
estimated. In particular, if the number of partition blocks is estimated correctly, 
then we show that maXj S [ B ] \Tj — Tj\ < nd n with probability tending to 1, where 
{Sn}n is a non-increasing sequence of positive numbers that tends to zero. If 
the number of partition segments is over estimated, then we show that for a 
distance defined for two sets A and B as 




(13) 



h{A,B) := sup inf \a - b\ 



(14) 
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we have h(7~,T) < nS n with probability tending to 1. With the boundary 
segments consistently estimated, we further show that under suitable conditions 
for each node a £ V the correct neighborhood is selected on all estimated block 
partitions that are sufficiently large. 

The proof technique employed in this section is quite involved, so we briefly 
describe the steps used. Our analysis is based on careful inspection of the op- 
timality conditions that a solution (3 a of the optimization problem ([3]) need to 
satisfy. The optimality conditions for (3 a to be a solution of ([3]) are given in W6.2\ 
Using the optimality conditions, we establish the rate of convergence for the par- 
tition boundaries. This is done by proof by contradiction. Suppose that there 
is a solution with the partition boundary T that satisfies h(7~,T) > n5 n . Then 
we show that, with high-probability, all such solutions will not satisfy the KKT 
conditions and therefore cannot be optimal. This shows that all the solutions to 
the optimization problem @ result in partition boundaries that are "close" to 
the true partition boundaries, with high-probability. Once it is established that 
T and T satisfy h(T,T) < nS n , we can further show that the neighborhood 
estimates are consistently estimated, under the assumption that the estimated 
blocks of the partition have enough samples. This part of the analysis fol- 
lows the commonly us e d strategy to prove that the Lasso is sparsistent ( see fo r 
example iBunea , 2008; IWainwrightl . 2009; Mci nshausen and Buhlmannl . 120061) . 
however important modifications are required due to the fact that position of 
the partition boundaries are being estimated. 

Our analysis is going to focus on one node a £ V and its neighborhood. 
However, using the union bound over all nodes in V, we will be able to carry 
over conclusions to the whole graph. To simplify our notation, when it is clear 
from the context, we will omit the superscript a and write /3, 6 and S, etc., to 
denote $ a , a and S a , etc. 



3.1 Assumptions 

Before presenting our theoretical results, we give some definitions and assump- 
tions that are going to be used in this section. Let A m ; n :— minj G [ B ] \Tj — Tj_i| 
denote the minimum length between change points, £ m i n := min ae y min je [s-i] | \0' 
a ^ || 2 denote the minimum jump size and $min — min a £y min^^] min^gi I 
the minimum coefficient size. Throughout the section, we assume that the fol- 
lowing holds. 

Al There exist two constants 4> m - m > and max < oo such that 
^mi„ = min {A min (S J ) : j e [B],a G V} 

and 

'/'max = max {A max (S- J ) : j € [B], a € V}. 
A2 Variables are scaled so that a 3 aa = 1 for all j £ [B] and all a £ V. 
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The assumption Al is commonly used to ensure that the model is identifiable. 
If the population covariance matrix is ill-conditioned, the question of the correct 
model identification if not well defined, as a neighborhood of a node may not 
be uniquely defined. The assumption A2 is assumed for the simplicity of the 
presentation. The common variance can be obtained through scaling. 

A3 There exists a constant M > such that 

max max ||0°' fe - Q ' J '|| 2 < M. 

aGV j,k£[B] 



The assumption A3 states that the difference between coefficients on two dif- 
ferent blocks, ||<? a,fc — aj ||2, is bounded for all j,k € [B]. This assumption is 
simply satisfied if the coefficients 9 a were bounded in the £2 norm. 

A4 There exist a constant a £ (0, 1], such that the following holds 
m f* l S iv^( S sisi)~i« < 1 - Va G V. 



The assumption A4 states that the variables in the neighborhood of the node a, 
Si, are not too correlated with the variables in the set N^. This assump- 
tion is necessary and sufficient for correct identification of the relevan t vari- 



ables in the Lasso regression pro blems (see for example I Zhao and Yul . 12006 



van de Geer and Buhlmann , 20091 ). Note that this condition is sufficient also in 



our case when the correct partition boundaries are not known. 

A5 The minimum coefficient size 9 m i n satisfies 6 min = £l(y/log(n)/n). 

The lower bound on the minimum coefficient size 9 m i n is necessary, since if a 
partial correlation coefficient is too close to zero the edge in the graph would 
not be detectable. 



A6 The sequence of partition boundaries {Tj} satisfy Tj = \ nTj\, where {tj} 
is a fixed, unknown sequence of the boundary fractions belonging to [0, 1]. 

The assumption is needed for the asymptotic setting. As n — > 00, there will 
be enough sample points in each of the blocks to estimate the neighborhood of 
nodes correctly. 



3.2 Convergence of the partition boundaries 

In this subsection we establish the rate of convergence of the boundary partitions 
for the estimator (j3|). We start by giving a lemma that characterizes solutions 
of the optimization problem given in ((3]). Note that the optimization problem 
in ([3]) is convex, however, there may be multiple solutions to it, since it is not 
strictly convex. 
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Lemma 1. A matrix (3 is optimal for the optimization problem ^ if and only 
if there exist a collection of subgradient vectors {zi} i6 [ 2: n] and {yi}ig[ n ], with 
Zj G 9||/3.,< — y9.,i-i||2 and yi e d\\f3.,i\\i, that satisfies 

n n n 

^2xi,\a(Xi,\a, ~ -^2*iAa£* + AlZfc + A 2 ^ y« = (15) 

for all k £ [n] and i\ = z„ + i = 0. 

The following theorem provides the convergence rate of the estimated bound- 
aries of T, under the assumption that the correct number of blocks is known. 

Theorem 2. Let {xi}j g [„j be a sequence of observation according to the model in 
|T]). Assume that A1-A3 and A5-A6 hold. Suppose that the penalty parameters 

Ai and A 2 satisfy 

AixA 2 = 0(y/log(n)/n). (16) 

Let {/3-,i}ig[ n ] be any solution of ([3]) and let T be the associated estimate of the 
block partition. Let {S n } n >i be a non-increasing positive sequence that converges 
to zero as n — > oo and satisfies A m ; n > n5 n for all n > 1. Furthermore, suppose 
that (7i<J n £ min ) _1 Ai -> 0, CiinV^ -> and (£ min VnS^) ~ 1 yjp log n -> 0, i/ien 
i/ \T\ = B + 1 f/ie following holds 

P[max IT,- - f,-| < twJJ -^=^> i. 

Suppose that 5 n = (log n) 1 /n for some 7 > 1 and £ m i n = f2(ylogn7 (l°g n ) 7 )i 
the conditions of theorem 5 are satisfied, and we have that the sequence of 
boundary fractions {rj} is consistently estimated. Since the boundary fractions 
are consistently estimated, we will see below that the estimated neighborhood 
S(9 : ') on the block consistently recovers the true neighborhood 5 J '. 

Unfortunately, the correct bound on the number of block B may not be 
known. However, a conservative upper bound i? max on the number of blocks 
B may be known. Suppose that the sequence of observation is over segmented, 
with the number of estimated blocks bounded by i? max - Then the following 
proposition gives an upper bound on h(T,T) where h{-, •) is defined in (1141) . 

Proposition 3. Let {xi} ie [ n ] be a sequence of observation according to the 
model in (JXJ) . Assume that the conditions of theorem® are satisfied. Let (3 be 
a solution of <|3j> and 7 the corresponding set of partition boundaries, with B 
blocks. If the number of blocks satisfy B < B < B mayi , then 

W[h(f,T) < nS„] 1. 

The proof of the proposition follows the same ideas of theorem [5] and its 
sketch is given in the appendix. 

The above proposition assures us that even if the number of blocks is overesti- 
mated, there will be a partition boundary close to every true unknown partition 
boundary. 
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Figure 1: The figure illustrates where we expect to estimate a neighborhood of a 
node consistently. The blue region corresponds to the overlap between the true 
block (bounded by gray lines) and the estimated block (bounded by black lines). 
If the blue region is much larger than the orange regions, the additional bias 
introduced from the samples from the orange region will not considerably affect 
the estimation of the neighborhood of a node on the blue region. However, we 
cannot hope to consistently estimate the neighborhood of a node on the orange 
region. 

3.3 Correct neighborhood selection 

In this section, we give a result on the consistency of the neighborhood esti- 
mation. We will show that whenever the estimated block & is large enough, 
say \&\ > r n where {r„}„>i is an increasing sequence of numbers that satisfy 
(?"nA2) _1 Ai — > and r„A| — > oo as n — > oo, we have that Sfo) = S(/3 k ), 
where (3 k is the true parameter on the true block B k that overlaps & the most. 
Figure Q] illustrates this idea. The blue region in the figure denotes the overlap 
between the true block and the estimated block of the partition. The orange 
region corresponds to the overlap of the estimated block with a different true 
block. If the blue region is considerably larger than the orange region, the bias 
coining from the sample from the orange region will not be strong enough to 
disable us from selecting the correct neighborhood. On the other hand, since the 
orange region is small, as seen from Theorem^ there is little hope of estimating 
the neighborhood correctly on that portion of the sample. 

Suppose that we know that there is a solution to the optimization prob- 
lem ([3]) with the partition boundary T . Then that solution is also a minimizer 
of the following objective 

B B 

min y ||xf -Xf>J'||l + 2Ai V||^'-^- 1 || 2 +2A 2 V|B i |||0 J '||i. (17) 
e\...,e 6 ^ x *r-t r-f 

jeB J= 2 j=i 

Note that the problem (IT7|) does not give a practical way of solving ([3]), but 
will help us to reason about the solutions of ([3]). In particular, while there 
may be multiple solutions to the problem ([3]), under some conditions, we can 
characterize the sparsity pattern of any solution that has specified partition 
boundaries T. 

Lemma 4. Let f3 be a solution to ([3]), with T being an associated estimate of 
the partition boundaries. Suppose that the subgradient vectors satisfy \yi_b\ < 1 
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for all b (ji 5'(/3. j i), then any other solution (3 with the partition boundaries T 
satisfy ftb.i = for all b ^ S(f3. t i). 

The above Lemma states sufficient conditions under which the sparsity pat- 
ter of a solution with the partition boundary T is unique. Note, however, that 
there may other solutions to © that have different partition boundaries. 

Now, we are ready to state the following theorem, which establishes that the 
correct neighborhood is selected on every sufficiently large estimated block of 
the partition. 

Theorem 5. Let {xi} ie [„] be a sequence of observation according to the model 
in (JTJ) . Assume that the conditions of theorem [H are satisfied. In addition, 
suppose that A 4 also holds. Then, if \T\ = B + 1, it holds that 

p[s k = s(e k )] i, vfc g [b]. 

Under the assumptions of theorem [5] each estimated block is of size 0(n). 
As a result, there are enough samples in each block to consistently estimate the 
underlying neighborhood structure. Observe that the neighborhood is consis- 
tently estimated at each i g B 3 ' D B 3 for all j € [B] and the error is made only 
on the small fraction of samples, when i ^ B 3 n B 3 , which is of order 0(n5 n ). 

Using proposition [3] in place of theorem [3J it can be similarly shown that, 
for a large fraction of samples, the neighborhood is consistently estimated even 
in the case of over-segmentation. In particular, whenever there is a sufficiently 
large estimated block, with \B k n B^\ = 0(r n ), it holds that S(B k ) = 5 J with 
probability tending to one. 

4 Numerical studies 

In this section, we present a small numerical study on the proposed algorithm 
on simulated networks. A full performance test and application on real world 
data is beyond the scope of this paper which mainly focuses on the theory of 
time-varying model estimation. In all of our simulations studies we set p = 30 
and B = 3 with |B X | = 80, \B 2 \ = 130 and \B 3 \ = 90, so that in total we 
have n = 300 samples. We consider two types of random networks: a chain 
and a nearest neighbor network. We measure the performance of the estimation 
procedure outlined in Sj2]on the following metrics: average precision of estimated 
edges, average recall of estimated edges and average Fi score which combines 
the precision and recall score. The precision, recall and F± score are respectively 
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Figure 2: A chain graph 



defined as 



precision 



recall 



n ^f, 

te [n] 



E ae[p i ELa+i ^{(a, b) £ Ei A (a, 6) G 



i£[n] ^a£[p] 

2 * precision * recall 
precision + recall 



EaewELo+i I{(a,6)G^i} 



Our results are averaged over 50 simulation runs. We compare our algorithm 
against an oracle algorithm which exactly knows t he true partition boundaries. 
In this case, it is only needed to run the algorithm of lMeinshausen and Biihlmann 
( 2006) on each block of the partition independently. We use a BIC criterion to 



select the tuning parameter for this oracle procedure as described in lPeng et al 
<|2009h . 



(|2009h to 



generate 



Chain networks. We follow the simulation in lFan et al. 
a chain network (see Figure [2]). This network corresponds to a tridiagonal 
precision matrix (after an appropriate permutation of nodes). The network is 
generated as follows. First, we choose generate a random permutation ir of [n]. 
Next, the covariance matrix is generated as follows: the element at position 



. < t p and 



exp(— \t 



7r(a) C 7r(f>) 



*tt(6)I/2) where t x < t 2 < 



(a, b) is chosen as <r a {, 
ti — U-i ~ Unif(0.5, 1) for i — 2, . . . ,p. This processes is repeated three times 
to obtain three different covariance matrices, from which we sample 80, 130 and 
90 samples respectively. 

For illustrative purposes, Figure [3] plots the precision, recall and F\ score 
computed for different values of the penalty parameters Ai and A2. Table [2] 
shows the precision, recall and F\ score for the parameters chosen using the 
BIC score described in !2.2l The numbers in parentheses correspond to standard 
deviation. Due to the fact that there is some error in estimating the parti- 
tion boundaries, we observe a decrease in performance compared to the oracle 
procedure that knows the correct position of the partition boundaries. 

Nearest neighbors network s. We generate n earest neighbor networks 
following the procedure outlined in lLi and Gui (l2006h . For each node, we draw 



Table 2: Performance on chain networks 



Method name 


Precision 


Recall 


-Fi score 


TD-Lasso 
Oracle procedure 


0.84 (0.04) 
0.97 (0.02) 


0.80 (0.04) 
0.89 (0.02) 


0.82 (0.04) 
0.93 (0.02) 
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Figure 3: Plots of the precision, recall and Fx scores as functions of the penalty 
parameters Ai and A2 for chain networks. The parameter Ai is obtained as 
100 * 0.98 50+ \ where i indexes y-axis. The parameter A2 is computed as 285 * 
0.98 230+J , where j indexes x-axis. The white region of each plot corresponds to 
a region of the parameter space that we did not explore. 



a point uniformly at random on a unit square and compute the pairwise dis- 
tances between nodes. Each node is then connected to 4 closest neighbors (see 
FigureS]). Since some of nodes will have more than 4 adjacent edges, we remove 
randomly edges from nodes that have degree larger than 4 until the maximum 
degree of a node in a network is 4. Each edge (a, b) in this network corresponds 
to a non-zero element in the precision matrix whose value is generated uni- 
formly on [—1,-0.5] U [0.5,1]. The diagonal elements of the precision matrix 
are set to a smallest positive number that makes the matrix positive definite. 
Next, we scale the corresponding covariance matrix X = fi -1 to have diago- 
nal elements equal to 1. This processes is repeated three times to obtain three 
different covariance matrices, from which we sample 80, 130 and 90 samples 
respectively. 

For illustrative purposes, Figure [5] plots the precision, recall and Fx score 
computed for different values of the penalty parameters Ai and A2. Tabled] 
shows the precision, recall and Fx score for the parameters chosen using the 
BIC score, together with their standard deviations. In the same table, we give 
the results of the oracle procedure. 
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Figure 4: An instance of a random neighborhood graph with 30 nodes. 



5 Conclusion 

We have addressed the problem of time-varying covariance selection when the 
underlying probability distribution changes abruptly at some unknown points 
in time. Using a penalized neighborhood selection approach with the fused-type 
penalty, we are able to consistently estimate times when the distribution changes 
and the network structure underlying the sample. The proof technique used to 
establish the convergence of the boundary fractions using the fused-type penalty 
is novel and constitutes an important contribution of the paper. Furthermore, 
our procedure estimates the network structure consistently whenever there is 
a large overlap between the estimated blocks and the unknown true blocks of 
samples coming from the same distribution. The proof technique used to estab- 
lish the consistency of the network structure builds on the proof for consistency 
of the neighborhood selection procedure, however, important modifications are 
necessary since the times of distribution changes are not known in advance. Ap- 
plications of the proposed approach range from cognitive neuroscience, where 
the problem is to identify changing associations between different parts of a 
brain when presented with different stimuli, to system biology studies, where 
the task is to identify changing patterns of interactions between genes involved 
in different cellular processes. We conjecture that our estimation procedure is 



Table 3: Performance on nearest neighbor networks 



Method name 


Precision 


Recall 


Fi score 


TD-Lasso 
Oracle procedure 


0.79 (0.06) 
0.87 (0.05) 


0.76 (0.05) 
0.82 (0.05) 


0.77 (0.05) 
0.84 (0.04) 
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Precision Recall 




Figure 5: Plots of the precision, recall and F% scores as functions of the penalty 
parameters Ai and A2 for nearest neighbor networks. The parameter Ai is ob- 
tained as 100 * 0.98 50+ \ where i indexes y-axis. The parameter A2 is computed 
as 285 * O.OS 230 ^ , where j indexes ir-axis. The white region of each plot corre- 
sponds to a region of the parameter space that we did not explore. 

also valid in the high-dimensional setting when the number of variables p is 
much larger than the sample size n. We leave the investigations of the rate of 
convergence in the high-dimensional setting for a future work. 

6 Proofs 

6.1 Proof of Lemma Q] 

For each ie [n], introduce a (p — l)-dimensional vector % defined as 

= f for i = 1 

1 f3. t i — f3. t i-i otherwise 



20 



and rewrite the objective (O as 



{"7*}ie[n] = argmin V^Xi.a - ^ £i,b Y]7.?V 

+ 2A 1 ^|| 7l || 2 + 2A 2 ^ XllE^I" 

i=2 i=l be\a j<i 

A necessary and sufficient condition for {7i}i e [„] to be a solution of (|T%|). is 
that for each k £ [n] the (p — l)-dimensional zero vector, 0, belongs to the 
subdifferential of (p~8|) with respect to 7& evaluated at {7i}ie[nl) that is, 

n n 

= 2^(-x iAo )(x i , a - ^o; < ,i ) 4 a ji ) + 2Aiz fc + 2A2^yi, (19) 

z— k b£\a i—k 

where z fc £ d\ \ ■ 1 1 2 ('Tfe ) , that is, 



i k = I llTfcll 



if % + 
£ 162(0, 1) otherwise 



and for k < i, y t £ <9| X}j<i 7jl> that is, y, = signQ^Tj) with sign(O) £ 
[— 1, 1]. The Lemma now simply follows from (|1£ 

6.2 Proof of Theorem [2] 



We bu ild on the ideas presented in the proof of Proposition 5 in lHarchaoui and Levv-Leduc 
( 201dh . Using the union bound, 



P[max \T 3 -f J \> n6 n ] < £ F[\Tj -f j \> n5 n ] 
MB] MB] 

and it is enough to show that P[\Tj —Tj\ > nS n ] — > for all j £ [B\. Define the 
event A n j as 

A n .j := {\Tj -Tj\ > nd„} 

and the event C n as 

C n :=(max|T J --T i | < ^ 

We show that PL4„,j] by showing that both PL4 nj n C„] -> and F[A n j D 
C ? C J -} as n -> 00. The idea here is that, in some sense, the event C n is a 
good event on which the estimated boundary partitions and the true boundary 
partitions are not too far from each other. Considering the two cases will make 
the analysis simpler. 

First, we show that ¥[A n j n C n ] — > 0. Without loss of generality, we assume 
that Tj < Tj, since the other case follows using the same reasoning. Using fj 15|) 
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twice with k = Tj and with k = Tj and then applying the triangle inequality 
we have 



2Ai > 



Tj-l 



Tj-l 



Tj-l 



^2 Xi,\a(x%,\a>P;i ~ P;i) ~ E ^i,\a^i + ^2 ^ y, 

i=fi 



=Tj i=T, 

Some algebra on the above display gives 



(20) 



2Ai + (Tj - Tfjy/pXi > 



Tj-l 

E 

i=f j 
Ti — l 



x <A „<x 4Ao ,^ +1 -^ +1 ; 



i=Ti 



Tj-l 
i=fi 



ll-Rllh — II-R2II2 — 1 1 1 1 2 - 



The above display occurs with probability one, so that the event {2Ai + (Tj — 

Tj)^p\ 2 > J||-Ri||2}U{||ik||2 > §||#i||a}U {\\R3W2 > k\\ R ih} also occurs 
with probability one, which gives us the following bound 

F[A n ,j n C„] < P[A,j nc„n {2Ai + (Tj - fj)^p\ 2 > ^\\Ri\\ 2 }] 

+ P[A n . J nc n n{\\R 2 \\ 2 > i||JJi|| 2 }] 

+ PK, J -nc n n{||i? 3 || 2 > ~\\RiM] 

=: P[A nijil ] + P[A n>jt 2 ] + PL4 nJ , 3 ]. 



First, we focus on the event A n j i. Using lemmaEl we can upper bound P[A 



with 



P[2Ai + (Tj - Tj)VpA 2 > ^(Tj - Tj^min] + 2 exp(-n5 n /2 + 2 logn). 

Since under the assumptions of the theorem (n£ n £ m in) _1 Ai — > and ^l n y/pX 2 — > 
as n — > oo, we have that P[v4 nj ,i] — > as n — > oo. 

Next, we show that the probability of the event A n j t2 converges to zero. 
Let Tj := |2~ 1 (T J - + T j+l )\. Observe that on the event C n , f j+1 > Tj so that 
/§.,i = for all i G [r j; 2}]. Using (JTSJ) with jfe = Tj and jfe = f 3 we have that 



2 Ax + (Tj - Tj)jp\ 2 > 



T,-l 



Ti — l 



Using lemma |S] on the display above we have 

36Ax + 18(2} - Tj-Jv^Aa + 18 



||^'+i_^+i|| 2 < 



(T j+ i - T 3 



(21) 
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which holds with probability at least 1 — 2 exp(— A min /4 + 2 log n). We will use 
the above bound to deal with the event {H-R2II2 > §11-^1 1 1 2}- Using lemma 13 
we have that ^ min (Tj-Tj)£ min /9 < \\R1W2 and ||i? 2 || 2 < (T i -T i )9& M x||0 , ' +1 - 
^■ J+1 ||2 with probability at least 1— 4 cxp(— n5 n /2+2 logn). Combining with ([21]) . 
the probability V[A n .^2] is upper bounded by 

P [ C l ( / , min^max A minCmi„ < Al] + P^Ln^mL&nin < \ZpA 2 ] + 

p [ c 30min0mLCmin < (Tj - ) — 1 1 1 ^2 x^e*^] + c 4 exp(-n<5„/2 + 21ogn). 

i=Tj 

Under the conditions of the theorem, the first term above converges to zero, 
since A m ; n > nd n and (n^„^ m i n ) _1 Ai — > 0. The second term also converges to 
zero, since Cmhi\/pA2 0. Using lemma [Jj the third term converges to zero 
with the rate exp(— ce logn), since (£ m mV 'A m i n ) _1 ^/plogn —> 0. Combining all 
the bounds, we have that PL4 nji2 ] — > as n — >• oo. 

Finally, we upper bound the probability of the event A n j^. As before, 
4>min(Tj — Tj)^ m i n /9 < ||-Ri||2 with probability at least 1 — 2 exp(— nS n /2 + 
2 logn). This gives us an upper bound on Pf^^j^] as 



(h ■ P ■ II HiLm x i,\a ei H 2 

Vmmsmm ^ 't—lj >\ 



T,-l 

x i,\a e i||2l 

2 exp(— nS n /2 + 2 logn) 



27 " 3j — Tj 

which, using lemma [JJ converges to zero as under the conditions of the theorem 
(£min Vn&n ) ~ 1 Vp log n — > 0. Thus we have shown that PL4 n ,j,3] — > 0. Since the 
case when Tj > Tj is shown similarly, we have proved that P[^4 n ,j H C„] — > as 
n — > 00 . 

We proceed to show that F[A n .j n C£] — > as n — > 00. Recall that C° = 
{maxj S [B] |Tj — Tj| > A m i n /2}. Define the following events 



{ajGfB], Tj < t,_i| n C£, 



^i m) := {vj e [fl], Tj_! < Tj < T i+1 } n C£ 



{Bje[B], % -,>T j+1 }nC c n 



and write PL4 n ,j n Q] = PL4 nJ n T>i°] + PL4 nJ n T>£ m) ] + P[A nJ n D^ ] ]. First, 
consider the event A n j f~l f« under the assumption that Tj < Tj. Due to 
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symmetry, the other case will follow in a similar way. Observe that 



< P{A n , 3 n {(f i+1 - Tj) > HFI n DM] 



T 

A min i 



»[{(T i+1 -f i+1 )>^}nDM] 
A 

< p[Aj n {(f i+x - Tj) > -f^} n dW] 
+ ^ P[{(r fc -f fe )>^}n{(T fc+1 -r fc )>^}ni?(™)]. 

fc=j+i 



We bound the first term in ([221) and note that the other terms can be bounded in 
the same way. The following analysis is performed on the event A n j n {(Tj + \ — 

Tj) > A min /2} n Z)£ to) . Using (JTSJ) with k = Tj and k = Tj, after some algebra 
(similar to the derivation of (|20p ) the following holds 

18Ai + 9(Tj - ffiy/pXi + 9|| Y^Lf 1 x i,\« e *H 
||0 J - J+1 || 2 < 



4>xoxxi{Tj - Tj) 

with probability at least 1—2 exp(— n<5„/2+2 logn), where we have used lemma[8l 
Let Tj = ^^{Tj + T J+1 )\. Using (J^J with fc = Z} and fc = 1} after some al- 
gebra (similar to the derivation of ([21]) ') we obtain the following bound 

Uai fli+lll ^ 18A l+ 9 (^-^)^ A 2+ 9 IIEK^\a^l|2 

IP — » I 2 < t= : 

<j>min[Tj ~ Tj) 

+ 810 max 0-f n ||^'-^' +1 || 2 , 

which holds with probability at least l — c\ exp(— n<5„/2 + 2 logn), where we have 
used lemma [8] twice. Combining the last two displays, we can upper bound the 
first term in ([2"2"j) with 

P[£min™$n < ClAl] + P[£ m in < C 2 ^pX 2 ] 



+ P^minV^n < C3 VP log n] + c 4 exp(-c 5 logn), 

where we have used lemma [7] to obtain the third term. Under the conditions 
of the theorem, all terms converge to zero. Reasoning similar about the other 
terms in ([2"2")l . we can conclude that P[^4n,j H D^] — > as n — > oo. 

Next, we bound the probability of the event A n j l~l Dn\ which is upper 
bounded by 

B 

nD ( n ] ] < 5]2^ 1 P[max{? e [B] : Tj < = j]. 
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Observe that 

{max{i G [B] : f, < T,_i} = j} 

c L){Tj - f, > ^p} n {f, +1 - Tj > 

so that we have 

P[D«] < 2 s - 1 £ ]Tp[{D — Ti> ^} n {T} +1 - t* > ^p>]. 

Using the same arguments as those used to bound terms in l|22p. we have that 
as rt — > oo under the conditions of the theorem. Similarly, we 
can show that the term P[_D£ r ^] — > as n — > oo. Thus, we have shown that 
P[A n j n CfJ —> 0, which concludes the proof. 



6.3 Proof of Lemma 3] 

Consider T fixed. The lemma is a simple consequence of the duality theory, 
which states that given the subdifferential (which is constant for all i G B\ 
& being an estimated block of the partition T), all solutions {/3-,i},e[nl °^ © 
need to satisfy the complementary slackness condition X)fce\a iM^M — 
which holds only if (3 b ^ = for all b G \a for which \yi^\ < 1. 



6.4 Proof of Theorem [5] 

Since the assumptions of theorem [2] are satisfied, we are going to work on the 
event 

£ := {max IT, — TA < nS n \. 
In this case, \B k \ = 0(n). For i G B k , we write 

X l: a = ^2 Xl ' b0 b + e * + £ * ( 23 ) 

besi 

where = J2bes x i,b{Pb,i — 9 k ) is the bias. Observe that Mi G B k (lB k , the bias 
e, = 0, while for i g" B n B k , the bias e, is normally distributed with variance 
bounded by M 2 max under the assumption Al and A3. 

We proceed to show that S(9 k ) C S k . Since fc is an optimal solution of 
it needs to satisfy 

(xf:)'xf: ( ^-^)-(xf:)' ( e^ + ^) 



+ A 1 (z ffc _ i -z^ + A 2 |#|y ffe _ i =0. 
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Now, we will construct the vectors 9 k , Zf k i , if k and y~f k i that satisfy (IM1) and 
verify that the subdifferential vectors are dual feasible. Consider the following 
restricted optimization problem 



min £ -X?>'||i 



2Ai^||0 i -^- 1 || 2 

j=2 



(25) 



2A 2 £|#|||^|| 1) 



where the vector 6^ k is constrained to be 0. Let {0 J '} 3 - e r& t> e a solution to the 
restricted optimization problem (|2"5j) . Set the subgradient vectors as Zf, 



T fc _i ^ 

d\\6 k - e k -% z Tk e <9||0 fc+1 - fc || and ff^s" = sign(0| fc ). Solve (EH for 
y^, ^fc. By construction, the vectors 9 k ,Zf k ,Zf and yf i satisfy (IM|) . 

Furthermore, the vectors Zrp and z^ are elements of the subdifferential, and 

-ifc-i -it 

hence dual feasible. To show that 9 k is also a solution to (fTT)) , we need to show 
that ||y^, jy-klloo < 1j that is, that y Tfc_1 is also dual feasible variable. Using 
lemma |H if we show that yf k i Nk is strict dual feasible, ||y^, jv*||oo < 1, 

then any other solution k to (fTT)) will satisfy 0^ = 0. 



From 



we can obtain an explicit formula for S k 



nk 

°s k 



((Xf^'xf:)" 1 (Ar(z f 



T k -!,S k 



"T k ,S k > 



(26) 



Recall that for large enough n we have that \B\ > p, so that the matrix 
(X| fc )'X| fc is invertible with probability one. Plugging (j2l))) into (|24l) . we have 
that ||yf fe j a": I loo < 1 if max bgA rfc < 1, where is defined to be 



n:=(xf) [x|:( ( x|:yxf:)" 



|S fc |A 2 



where H„ fc ' is the projection matrix 



H^ fc ,J ~ — I — Xf fc ( (X| fc )'x| fc 



|B fe |A 2 
|£ fe |A 2 



(27) 



Let S fc and £ be defined as 



1 



1 1 l £B fc 



7fr- X \a( X \a)'- 
1 1 i(LB k 
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For i € [n], we let B(i) index the block to which the sample i belongs to. Now, 
for any b £ N k , we can write x\ = (S^^ fc ) Xg fc +w l b where w\ is normally 
distributed with variance a\ < 1 and independent of x^ fc . Let Fb e Rl 23 *! be 
the vector whose components are equal to Efgl (Egig*) Xg fe , i G # fe , and 
W(, e RI b ''I be the vector with components equal to w\. Using this notation, 
we write Y b = + T b 2 + T b 3 + T b 4 where 



= FJXS ((Xgyxg)- 1 (y f _ + Al( ^: % ' Sfc) ) (28) 



J 6 - t b tl s k 



/ e B +e B \ 
V |6 fc |Ao / 



?6 = (W 6 



x l fc (( X I*)' X f fc 



Ai(z^ 



S k f k ,S k 



\B k \X 2 



H 



U /p8 _1_ «:B 



|i3 fc |A 2 



and 



T b 4 = - 



(29) 



(30) 



(31) 



\B k \\ 2 

We analyze each of the terms separately. Starting with the term T b , after some 
algebra, we obtain that 



j : B k r\Bi^$ ' ' 

+ ^&S<= ((^S fc S fc ) - (S^fcgfc) ) 

+ S bSfc (S| fcs . fc ) . 

J32) 

Recall that we are working on the event £ , so that |||S^ rs , s , s ,(S| f . sfc ) loo > 

|^» s »(E§ fcsfc ) _1 |«> and (l^ fc | A 2)- 1 Ai(z #)t _ i ^ -Zf fciSfc ) element- 

wise. Using (|37p we bound the first two terms in the equation above. We bound 
the first term by observing that for any j and any b € N k and n sufficiently 
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large 

\B 3 n gfc | i lv j rv j wi/vs^ne* v i 



Jfel 
|#|" 



— j#i — " sfcsfc ~ ^s^-s^l 00 - £l 

with probability 1 — ci exp(— C2 logn). Next, for any b € iV fc we bound the 
second term as 

~ k ~ k 1 ~ k 1 

H^bS fc (C^S k S k ) ~ C^S k S k ) )Hl 

<C a ||(i!^ s *)" 1 -(Sl*s*)" 1 )llF 

^ 1 1 V* ^ 1 1 2 1 1 ^ k -\^k II i /O f 1 1 V ^ ^ 1 1 2 \ 

^ ^2| 1^5656 \\F\\ Zj S k S k ~ Zj S k S k W F + u \\\ Zj S k S k " S k S k Wf) 

< £2 

with probability 1— Ci exp(— C2 logn). Choosing ei, 62 sufficiently small and for n 
large enough, we have that maxf, \T^\ < 1 — a + o p (l) under the assumption A4. 
We proceed with the term T h 2 , which can be written as 



^s k 

i£B k nB k 



n = i\B k \M)- 1 Uts k (^wr 1 - nxi: ((xgyxf:) -1 ) ^ 

+ (|^|A 2 r 1 J2 (sSi ) (sfii)" 1 -F' 6 x|:((x|:yx|:)" 1 )4 fe (e i +e i ). 



i<?B k nB k 

Since we are working on the event £ the second term in the above equation is 
dominated by the first term. Next, using (l32j) together with (|37|) . we have that 
for all b e N k 



IS* * (Su- 



fis' 1 \^S k S'" 



Combining with Lemma [71 we have that under the assumptions of the theorem 



max|r 6 2 | = o p (l) 




We deal with the term T 3 by conditioning on X| fc and e e , we have that Wt 
is independent of the terms in the squared bracket in T b 3 , since all if k i s , if k s 
and yf k s are determined from the solution to the restricted optimization 

problem. To bound the second term, we observe that conditional on X| fc and 



2N 



e B , the variance of T h 3 can be bounded as 



Var(T 3 ) < ||xg ((xfj 

<v' s * (cxgy 



r) sk 



(33) 



where 



|B|A 2 



Using lemma H] and Young's inequality, the first term in (J32J) is upper bounded 

by 

18 / 2A 2 \ 



with probability at least 1 — 2exp(— |S fe |/2 + 21ogn). Using lemma [6] we have 
that the second term is upper bounded by 

with probability at least 1 — exp(— ci\B k \S' 2 + 21ogn). Combining the two 
bounds, we have that Var(T b 3 ) < cis(|S' c |) _1 with high probability, using the 
fact that (|S fc |A 2 ) _1 A 1 —> and |S fe | A 2 —> oo as n —> oo. Using the bound on 
the variance of the term T 3 and the Gaussian tail bound, we have that 

max|T 6 3 | =o p (l). 

b£N F 

Combining the results, we have that max^^t |Yj| < 1 — a + o p (l). For a 
sufficiently large n, under the conditions of the theorem, we have shown that 
max heA r \Y b \ < 1 which implies that P[S(0 k ) C S k ] ^=^> i. 

Next, we proceed to show that V[S k C S(9 k )} ^=^> 1. Observe that 



[S k <£S(6 k )} <P[\\9 k sk -9 k s 



> 



From 



we have that \\0g k — 9g k \\oo is upper bounded by 



1 



i 



(X| fc )'X|i 



Ai(zf 



*T k ,S k ' 



^\B B \y fk _ uSk 



Since ^ only on i £ B k \B k and nS n /\B k \ — > 0, the term involving e B is 
stochastically dominated by the term involving e B and can be ignored. Define 
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the following terms 



\ -l 

1 '~rB k \/vB k \ 1 fv-B K \i^B K 



Conditioning on X^ fc , the term Tj is a | S ffe | dimensional Gaussian with variance 
bounded by c\/n with probability at least 1 — c% exp(— C2 logn) using lemma |SJ 
Combining with the Gaussian tail bound, the term | \T\ \ \ x can be upper bounded 

as 



IT II / l0 g S 



< c 2 exp(-c 3 logn). (34) 

n 

Using lemma[51 we have that with probability greater than l — ci exp(— c 2 log n) 
HTalU < ||T 2 || 2 < C3 -^-^0 

\B L \\2 

under the conditions of theorem. Similarly HTaH^ < Cis/s, with probability 
greater than 1 — c\ exp(— c 2 logn). Combining the terms, we have that 



|0*-e*IL<W— + <WSA a 



with probability at least 1 — C3 exp(— a logn). Since 9 m - m = fl(y/log(n)/n), we 
have shown that S C S(6 k ). Combining with the first part, it follows that 
S(9 k ) = S k with probability tending to one. 
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Appendix 

Technical results 

In this section we collect some technical results needed for the proves presented 
in H 
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Lemma 6. Let {C}ie[n] be a sequence of iid A/"(0, 1) random variables. If 
v n > Clogn, for some constant C > 16, then 



r 

□ {£(0) 2 <(1 + C)(r -; + !)} 



l<!<r-<n i—l 



> 1 — exp(— ci log n) 



for some constant c\ > 0. 

Proof. For any 1 < I < r < n, with r — I > v n we have 



P[E(C 1 ) 2 > (1 + Q(r ~l + l)]< ex P (-C(r - I + l)/8) 

< exp(-Clogn/8) 
using (|38[) . The lemma follows from an application of the union bound. 



□ 

Lemma 7. Let {xi}j g [„] 6e independent observations from ([1]) and let {ej}j 6 [ n ] 
6e independent 7V(0, 1). Assume that Al holds. If v n > Clogn for some 
constant C > 16, inen 

je[B] (,reBJ k i=i v ' ,T ' 

r-I>»„ 

> 1 — ci exp(-c 2 logn), 
for some constants C\,c% > 0. 

Proof. Let X 1 / 2 denote the symmetric square root of the covariance matrix Xgg 
and let B(i) denote the block £P of the true partition such that i £ BK With 

1/2 

this notation, we can write Xj = (X B W) u, where ~ 7V(0, 1). For any 
I < r e B J we have 



i=l 



i=l 



i=l 



Conditioning on {ej}j, for each b £ [p], 5^1=2 u i.ft e i is a normal random variable 
with variance Y!i=i( e i) 2 ■ Hence, ||X)i=j u i e i\\l/ (Yli=i( e i) 2 ) conditioned on {e^ 
is distributed according to % 2 and 



1 iiV 



f. 2 



> 



■'.i j I ax 



r-Z + 1 



Vp(1 + Clogn) {e,}[ = 



< > p(l + Clogn)] < exp(-Clogn/8), 

where the last inequality follows from f|38j> . Using lemma [6j for all l,r £ B' J 
with r — I > v n the quantity X)[=/( e i) 2 i s bounded by (1 + C)(r — I + 1) with 
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probability at least 1 — exp(— ci logn), which gives us the following bound 



e[Bl I,r€B3 V i = l 



je[B] i.reesi 



4 i/a 

fmax 

Vr - i + 



VT+C 



=-y/p(l + Clogn)} 



> 1 — ci exp(— C2 log n) . 



□ 



Lemma 8. Lei {xi} ig r n ] 6e independent observations from ([l}. Assume that 
Al holds. Then for any v n > p, 



and 



max A. 

l<!<r<n 
r-l>«„ 



min A min 

l<!<r<ra 



1 



r - 1 + 



Y X! Xi ( Xi )' ) - 9 ^max < 2n 2 exp(-u n /2) 



i=i 



1 



- ^x, (xi)' < mi „/9j < 2n 2 exp(-u„/2). 



r- I + 

Proof. For any 1 < I < r < n, with r — I > v n we have 
1 



A,. 



r-l + 



j^x t (x 4 )' ) > 90 max J < 2 exp(-(r - I + l)/2) 



< 2exp(-w„/2) 



using p5p. convexity of A max (-) and Al. The lemma follows from an application 
of the union bound. The other inequality follows using a similar argument. □ 

Proof of Proposition [3] 

The following proof follows main ideas already given in theorem [5] We provide 
only a sketch. 

Given an upper bound on the number of partitions -B m ax7 we are going to 
perform the analysis on the event {B < i? m ax}- Since 

Bmax 

P[h(f,T) > nS n | {B < B max }} < J2 F Ht,T) > nS n | {|T| = B' + 1}], 

B'=B 

we are going to focus on F[h(f,T) > n5 n | {|T| = B' + 1}] for B' > B (for 
B' = B it follows from theorem [2] that h(T,T) < n5 n with high probability). 
Let us define the following events 

Sj A = {31 G [B 1 ] : \fi - Tj\ > nS n , \f l+1 - T 3 \ > nS n and f, < Tj < f l+1 } 
£j,2 = {VZ G [£'] : \fi -TA> nS n and TJ < Tj} 
£j,3 = {VZ G : If; - Tj\ > nS n and f; > Tj}. 
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Using the above events, we have the following bound 

P[h(t,T) > nS n | {|f | =B' + 1}}< ]T Pfoi] + P[£ i)2 ] + P[£j, 3 ]. 

The probabilities of the above events can be bounded using the same reasoning 
as in the proof of theorem [5J by repeatedly using the KKT conditions given in 
(|15p . In particular, we can use the strategy used to bound the event A n j t 2- 
Since the proof is technical and does not reveal any new insight, we omit the 
details. 



A collection of known results 

This section collects some known results that we have used in the paper. We 
start by collecting some results on the eigenvalues of random matrices. Let 
x ~ A/"(0, S), i £ [n], and £ = n _1 ^Xi(xi)' be the empirical covariance 
matrix. Denote the elements of the covariance matrix S as [<7 a b] and of the 
empirical covariance matrix S as [a a b] ■ 

Using standard results on concentration of spectral norms and eigenvalues 
(Davidso n and Sz arck. l200lh JWainwrightl (|2009t) derives the following two crude 



bounds that can be very useful. Under the assumption that p < n, 

P[A max (S) > 90 max ] < 2exp(-n/2) (35) 
n(S) < 0mi„/9] < 2exp(-n/2). (36) 



From Lemma A. 3. in lBickel and Levinal ( 2008t ) we have the following bound 
on the elements of the covariance matrix 

V[\a ab ~ <Jab\ > e] < ci exp(-c 2 ne 2 ), |e| < e (37) 

where c\ and c 2 are positive constants that depend only on A max (S) and ep. 

Ne xt, we use the following tail bound for x 2 distribution from lLounici et al 
(2009), which holds for all e > 0, 



nxl >n + e]< cxp(-i min(e, I-)). (38) 
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